Address the data science problem
Since the one large data science problem can be divided into several smaller bits, we decide to combine the analysis and the interpretation sections together. And the questions will be addressed and discussed one by one.
The following questions are about data science career paths.
What percentage of the survey respondents are working under these job titles?
What are the statistics on salaries for these job titles?
The US minimum wage is 7.25 per hour, multiply that for a total \(260\) typical number of workdays a year:
\[7.25 \frac{\$}{\text{hour}} \cdot 8 \frac{\text{hours}}{\text{day}} \cdot 260\frac{\text{days}}{\text{year}} = 15080.\]
The new categories* for salaries will be: - poverty: below the federal minimum wage - low: 15,000 to 49,999 - medium: 50,000 to 99,999 - high: 100,000 to 199,999 - very high: 200,000 to 499,999 - highest: >= 500,000
*loosely based on US federal income tax brackets.
A chi-squared test of independence can be used to test for the statistical significance in the difference in categorical frequencies in each salary category for each job title. On the one hand, the test is done on data-oriented data science roles, see below:
##
## Pearson's Chi-squared test with simulated p-value (based on 10000
## replicates)
##
## data: q2_table.1
## X-squared = 180.52, df = NA, p-value = 9.999e-05
Qualitatively, it can be seen that each job title has a seemingly different categorical distribution over the salary brackets. The chi-squared test using simulated p-values reveals that this difference is statistically significant as well with a p-value of \(0.0004998\). At an alpha-level of 0.05, the null hypothesis that the frequency of counts over salary brackets is not independent of the job title.
Because the original salary data is categorical instead of a numerical estimation, a two-sample t-test to test the difference of mean salary is not possible. But the visualization above reveals that for data scientists and data engineers, there are more people in medium and high salary brackets than low salary brackets; the opposite is true for business analyst and data analyst. The visualization alone is able to reveal that “analyst” is considered a more junior role compared to an “engineer” or a “data scientist”. There are just too few DBA/Database engineers and statisticians represented in the dataset for a conclusion to be drawn for these categories.
It is also worth noting that most of the highest earning (\(>\$500,000\)) people in the dataset are data scientists.
Below, the analysis focuses on the job titles that have a focus on software and product management, such as machine learning engineer (MLE), software engineer (SWE), and product manager (PM).
##
## Pearson's Chi-squared test with simulated p-value (based on 10000
## replicates)
##
## data: q2_table.2
## X-squared = 39.402, df = NA, p-value = 0.0011
As can be seen above, more MLEs command the highest category of salary than any other job title. The average MLE will earn at or above the medium salary bracket. As for PMs, very few of them ear a low salary; in fact, most of them are in the medium to very high brackets of income. The salary distribution for project managers is similar to that of PMs, except there are significantly more professionals in the low income bracket. Last but not least, most SWEs command a medium to high salary, with still a significant number of professional earning outside this range on both the high end and the lower end.
What levels of education are required for these job titles?
##
## Master’s degree
## 0.466037736
## Bachelor’s degree
## 0.269433962
## Doctoral degree
## 0.164528302
## Some college/university study without earning a bachelor’s degree
## 0.064905660
## Professional doctorate
## 0.016981132
## I prefer not to answer
## 0.011320755
## No formal education past high school
## 0.006792453
At an aggregate level, it is evident that more than half of the data science community on Kaggle has a degree that is above the undergraduate level. Combining percentages of people with a master’s degree or a doctoral degree (including a professional doctoral degree) gives us that \(64.7/%\) of respondents hold a graduate degree. For the combination of programming skill level requirements, as well as the mount of requisite knowledge in statistics machine learning, this is an expected phenomenon.
Is there a significant income gap between genders for these jobs?
##
## poverty low medium high very high highest
## Man 125 108 369 584 187 28
## Nonbinary 1 5 5 5 0 1
## Prefer not to say 11 4 4 12 6 1
## Prefer to self-describe 1 1 0 1 0 0
## Woman 68 57 118 122 19 2
As seen above at an aggregate level, for salary levels high and above there are a total of 799 men and only 143 women. The gender map is apparent.
## [1] "< 1 years"
##
## poverty low medium high very high highest
## Man 13 20 46 17 4 0
## Nonbinary 0 1 0 0 0 0
## Prefer not to say 0 1 0 0 0 0
## Prefer to self-describe 0 1 0 0 0 0
## Woman 14 15 21 8 3 0
## [1] "10-20 years"
##
## poverty low medium high very high highest
## Man 10 7 34 142 49 11
## Nonbinary 0 0 0 1 0 0
## Prefer not to say 1 1 1 1 1 0
## Woman 6 5 6 24 7 1
## [1] "1-3 years"
##
## poverty low medium high very high highest
## Man 24 24 81 50 9 1
## Nonbinary 0 2 2 0 0 1
## Prefer not to say 3 1 1 0 0 0
## Prefer to self-describe 1 0 0 1 0 0
## Woman 16 10 29 13 0 0
## [1] "5-10 years"
##
## poverty low medium high very high highest
## Man 16 17 67 140 31 3
## Nonbinary 1 0 1 1 0 0
## Prefer not to say 1 1 0 6 2 0
## Woman 7 6 12 29 5 0
## [1] "20+ years"
##
## poverty low medium high very high highest
## Man 34 12 37 160 82 12
## Nonbinary 0 0 1 2 0 0
## Prefer not to say 5 0 1 3 3 1
## Woman 6 2 8 21 4 1
## [1] "3-5 years"
##
## poverty low medium high very high highest
## Man 17 18 76 60 9 1
## Nonbinary 0 1 1 1 0 0
## Prefer not to say 0 0 1 1 0 0
## Prefer to self-describe 0 0 0 0 0 0
## Woman 11 5 21 22 0 0
What is the typical skill set for these jobs? How does it affect the pay rate?
Here the key skill is defined as a skill that has been acquired more than 10% people under certain job title.
From the table, huge salary variances make it very hard to tell whether a skill will increase the salary or not.
Is there a certain correlation between industry and the need for these jobs?
A prop.table version.
## # A tibble: 5 × 6
## Title Academics Computers Finance Internet Medical
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Data Analyst 0.4 0.105 0.258 0.105 0.214
## 2 Data Engineer 0.047 0.091 0.129 0.105 0.071
## 3 Data Scientist 0.353 0.338 0.387 0.386 0.459
## 4 Machine Learning Engineer 0.082 0.15 0.056 0.211 0.071
## 5 Software Engineer 0.118 0.317 0.169 0.193 0.184
Some hypothesis that use prop.test to examine:
Data Analyst is more needed in Academics Industry.To Computers: X-squared = 38.1, p-value = 3.28e-10, 95% CI = (0.2, 1), H0 rejected.
To Finance: X-squared = 4.1, p-value = 2.19e-02, 95% CI = (0.02, 1), H0 rejected.
To Internet: X-squared = 13.2, p-value = 1.38e-04, 95% CI = (0.17, 1), H0 rejected.
To Medical: X-squared = 6.6, p-value = 5.07e-03, 95% CI = (0.06, 1), H0 rejected.
TRUE, Data Analyst is more needed in Academics Industry.
Data Scientist is more needed in Medical Industry.To Academics: X-squared = 1.7, p-value = 9.56e-02, 95% CI = (-0.02, 1), can’t reject H0.
To Computers: X-squared = 4.1, p-value = 2.14e-02, 95% CI = (0.02, 1), H0 rejected.
To Finance: X-squared = 0.9, p-value = 1.73e-01, 95% CI = (-0.05, 1), can’t reject H0.
To Internet: X-squared = 0.5, p-value = 2.36e-01, 95% CI = (-0.08, 1), can’t reject H0.
FALSE, Data Analyst has similar demand in these industries.
Software Engineer is more needed in Computers Industry.To Academics: X-squared = 12.2, p-value = 2.39e-04, 95% CI = (0.12, 1), H0 rejected.
To Finance: X-squared = 8.8, p-value = 1.51e-03, 95% CI = (0.07, 1), H0 rejected.
To Internet: X-squared = 2.9, p-value = 4.32e-02, 95% CI = (0.02, 1), H0 rejected.
To Medical:X-squared = 5.8, p-value = 8.17e-03, 95% CI = (0.05, 1), H0 rejected.
TRUE, Software Engineer is more needed in Computers Industry.
Machine Learning Engineer is more needed in Internet Industry.To Academics: X-squared = 2, p-value = 7.77e-02, 95% CI = (0, 1), can’t reject H0.
To Computers: X-squared = 3.3, p-value = 9.66e-01, 95% CI = (-0.14, 1), can’t reject H0.
To Finance: X-squared = 6.2, p-value = 6.32e-03, 95% CI = (0.04, 1), H0 rejected.
To Medical: X-squared = 3.3, p-value = 3.44e-02, 95% CI = (0.02, 1), H0 rejected.
FALSE, The demand of Machine Learning Engineer in Internet Industry is not significantly greater than in Academics and Computers.
Data Scientist is more needed than other jobs.To Data Analyst: X-squared = 53, p-value = 1.66e-13, 95% CI = (0.14, 1), H0 rejected.
To Data Engineer: X-squared = 143.1, p-value = 2.74e-33, 95% CI = (0.24, 1), H0 rejected.
To Software Engineer: X-squared = 29.5, p-value = 2.76e-08, 95% CI = (0.1, 1), H0 rejected.
To ML Engineer: X-squared = 113.3, p-value = 9.36e-27, 95% CI = (0.22, 1), H0 rejected.
TRUE, Data Scientist is more needed than other 4 jobs.
Salaries for these jobs vary widely from industry to industry.
What programming languages and IDEs do they use?
Survey questions Q7 (daily-used programming language), Q9 (IDE).
In the heatmap of programming language below, some trends can be found:
Python, R, and SQL are three most popular language in Kaggle community.Statisticians use R more than Python and SQL.C++, Java, Javascript are mainly used by Software Engineer.Julia and Swift are barely used by Kaggle users.In the heatmap of IDE below, some trends can be found:
Jupyter Notebook is the most popular IDE for Kaggle users.MATLAB is barely used in the industry(last 5 jobs).Statisticians use RStudio the most.VSCode is overall the second common-used IDE.From two plots above, while Python becomes the most popular programming language for Kaggle users, Jupyter Notebook (a Python IDE) also becomes the most commonly used IDE. Are there any correlation between them?
##
## Jupyter Python
## Data Analyst 167 186
## Data Engineer 62 80
## Data Scientist 357 407
## Machine Learning Engineer 85 100
## Software Engineer 134 179
## Statistician 9 19
## Student 292 396
\(H_a\): The distribution of Python is significantly different from the distribution of Jupyter Notebook.
All job titles: X-squared = 6.1, p-value = 0.42, can’t reject H0.
Without Statistician: X-squared = 4.3, p-value = 0.51, can’t reject H0.
Without Statistician, Student: X-squared = 2, p-value = 0.76, can’t reject H0.
FALSE. There is no significant difference between the distribution of Python and the distribution of Jupyter Notebook among jobs. In the other words, most people who use Python also use Jupyter Notebook as the IDE.
Where do they get and share the knowledge?
Survey questions Q39 (share and deploy), Q40 (learning resources), Q42 (Media sources).
In the heatmap of learning platform below, some trends can be found:
Coursera is the most popular learning platform outside the University.Statisticans learn knowledge mainly from University course.In the heatmap of sharing platform below, some trends can be found:
GitHub is the most popular sharing platform.Kaggle, Colab, Personal Blog are used by a small proportion of Kaggle Users.Student answered this question.In the heatmap of media source below, some trends can be found:
Blog, Kaggle, YouTube are three most popular media source.Statisticians use Journal Publication as their media sources.Special Job Title: Statistician
From all analysis above, Statistician shows many different preference than other jobs. For example:
Statistician.Academic industry.R more frequently, also they use RStudio a lot.University Courses.Journal Publication to get report on data science topics.Based on these features, Statisticians are very likely to be a job title specially for professors and researchers in academic institutions (i.e. universities).
As for the popular packages, one can easily give some examples that no strangers to data science practitioners. But the real problem is, what can be learned from this? Some conditional probabilities will highlight the findings.
The data is a subset of survey answer to Q14 and Q16, which emphasize on visualization libraries and machine learning related libraries respectively. A set of niche job titles are excluded due to limited number of responses, such as “DBA/Database Engineer”, “Currently not employed” and “Statistician.” The remaining valid responses are visualized as barplots by group to show the count of responses that are familiar with those packages.
Additionally, as the number of machine learning libraries covered in the survey exceeds a reasonable number of meaning palette colors, only the top 10 libraries are kept in the visualization.
## Selecting by count
There are some obvious results worth mentioning. But before that, it is important to reiterate that the comparison is based on the calculation of some conditional probabilities according to Bayes’ Theorem.
\[P(A|B) = \frac{P(B|A)\cdot P(A)}{P(B)} = \frac{P(A \cap B)}{P(B)}\]
where \(A\) and \(B\) are events.
## value
## Q5 Altair Bokeh D3 js Geoplotlib Ggplot / ggplot2
## Business Analyst 3 6 8 7 29
## Data Analyst 2 16 13 18 104
## Data Engineer 2 5 14 8 25
## Data Scientist 15 45 26 19 183
## Machine Learning Engineer 4 7 7 6 17
## Program/Project Manager 0 2 11 7 33
## Research Scientist 2 10 6 5 67
## Software Engineer 4 13 30 6 35
## Student 1 11 17 20 129
## Sum 33 115 132 96 622
## value
## Q5 Leaflet / Folium Matplotlib Plotly / Plotly Express
## Business Analyst 4 56 32
## Data Analyst 12 136 72
## Data Engineer 6 57 28
## Data Scientist 31 325 176
## Machine Learning Engineer 3 87 39
## Program/Project Manager 7 48 18
## Research Scientist 6 113 39
## Software Engineer 6 139 33
## Student 5 286 88
## Sum 80 1247 525
## value
## Q5 Seaborn Shiny Sum
## Business Analyst 39 13 197
## Data Analyst 106 44 523
## Data Engineer 39 9 193
## Data Scientist 270 75 1165
## Machine Learning Engineer 66 5 241
## Program/Project Manager 34 8 168
## Research Scientist 68 25 341
## Software Engineer 83 7 356
## Student 156 26 739
## Sum 861 212 3923
## value
## Q5 Caret Huggingface Keras LightGBM Prophet PyTorch
## Business Analyst 7 3 21 5 6 18
## Data Analyst 28 1 30 11 9 27
## Data Engineer 1 4 24 7 3 27
## Data Scientist 60 55 166 99 56 133
## Machine Learning Engineer 3 29 61 23 7 58
## Program/Project Manager 5 2 17 3 1 18
## Research Scientist 13 15 50 12 2 60
## Software Engineer 6 9 62 12 5 77
## Student 25 14 118 15 4 109
## Sum 148 132 549 187 93 527
## value
## Q5 Scikit-learn TensorFlow Tidymodels Xgboost Sum
## Business Analyst 50 21 9 18 158
## Data Analyst 114 44 25 51 340
## Data Engineer 52 31 6 14 169
## Data Scientist 345 186 35 229 1364
## Machine Learning Engineer 81 69 1 43 375
## Program/Project Manager 44 17 6 17 130
## Research Scientist 97 58 11 39 357
## Software Engineer 119 90 2 31 413
## Student 223 150 10 64 732
## Sum 1125 666 105 506 4038
\[ \begin{split} P(\text{Student}|\text{Visualization}) &= \frac{739}{3923} \approx 0.1884\\ P(\text{Data Scientist}|\text{Visualization}) &= \frac{1165}{3923} \approx 0.2970 \\ P(\text{Student}|\text{Machine Learning}) &=\frac{732}{4038} \approx 0.1813\\ P(\text{Data Scientist}|\text{Machine Learning}) &=\frac{1364}{4038} \approx 0.3378 \end{split} \]
Altair, bokeh, geoplotlib, leaflet/folium, matplotlib, plotly/plotly express, seaborn are categorized as Python-wise, while D3.js, ggplot2, leaflet/folium, plotly/plotly express and shiny are categorized as R-wise. Note that there are some overlapping among these packages since they do have multiple language support.Huggingface, Keras, LightGBM, Prophet, PyTorch, scikit-learn, TensorFlow, Xgboost are Python compatible. Caret, Keras, LightGBM, Prophet, TensorFlow, tidymodels and Xgboost and usable in R.\[ \begin{split} P(\text{Python}|\text{Visualization}) &= (33+115+96+80+1247+525+861)/3923 \approx 0.7538\\ P(\text{R}|\text{Visualization}) &= (132+622+80+525+212)/3923 \approx 0.4005\\ P(\text{Python}|\text{Machine Learning}) &= (132+549+187+93+527+1125+666+506)/4038 \approx 0.9373\\ P(\text{R}|\text{Machine Learning}) &= (148+549+187+93+666+105+506)/4038 \approx 0.5582\\ \end{split} \]
There are more to be discovered depending on the perspectives.
As the demand for computational power increases along with the amount of data involved in data science industry, cloud computing is a topic that any data science practitioners cannot avoid. The major
Specifically, will a user’s preference for a cloud computing platforms affect his or her preferences for other tools?
Will a user’s preference for cloud computing platforms affect his or her preference for other tools? For example, we want to know if an AWS EC2 dedicated user will actually prefer AWS S3 over other products.
Survey question Q29-A: computing products (Part_1) Survey question Q30: Storage (Part_3, Part_4) Survey question Q31-A: ML products (Part_1)
\[\chi^2=\sum\frac{(O_i-E_i)^2}{E_i}\]
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: aws_user$ec2 and aws_user$s3
## X-squared = 1532, df = 1, p-value < 2.2e-16
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: aws_user$ec2 and aws_user$efs
## X-squared = 637.99, df = 1, p-value < 2.2e-16
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: aws_user$ec2 and aws_user$sagemaker
## X-squared = 566.34, df = 1, p-value < 2.2e-16
What is the overall AWS usage percentage among DS practitioners? Is it the same for Google Cloud? What about Microsoft Azure?
\(H_0: p_A=p_B\), \(H_a:p_A\not=p_B\), where \(A\) and \(B\) can be replaced by AWS, Azure, or GCP, and \(n_A\), \(n_B\) are sample size of group \(A\) and \(B\) respectively.
The test statistic (z-statistic) can be calculated as follow:
\[z=\frac{p_A-p_B}{\sqrt{p(1-p)/n_A+p(1-p)/n_B}}\]
##
## 2-sample test for equality of proportions with continuity correction
##
## data: c(sum(cloud_comp$aws_usage), sum(cloud_comp$azure_usage)) out of c(nrow(cloud_comp), nrow(cloud_comp))
## X-squared = 81.285, df = 1, p-value < 2.2e-16
## alternative hypothesis: two.sided
## 95 percent confidence interval:
## 0.07606175 0.11865523
## sample estimates:
## prop 1 prop 2
## 0.2377358 0.1403774
##
## 2-sample test for equality of proportions with continuity correction
##
## data: c(sum(cloud_comp$azure_usage), sum(cloud_comp$gcp_usage)) out of c(nrow(cloud_comp), nrow(cloud_comp))
## X-squared = 1.8823, df = 1, p-value = 0.1701
## alternative hypothesis: two.sided
## 95 percent confidence interval:
## -0.005495458 0.031910552
## sample estimates:
## prop 1 prop 2
## 0.1403774 0.1271698
##
## 2-sample test for equality of proportions with continuity correction
##
## data: c(sum(cloud_comp$gcp_usage), sum(cloud_comp$aws_usage)) out of c(nrow(cloud_comp), nrow(cloud_comp))
## X-squared = 107.85, df = 1, p-value < 2.2e-16
## alternative hypothesis: two.sided
## 95 percent confidence interval:
## -0.1315249 -0.0896072
## sample estimates:
## prop 1 prop 2
## 0.1271698 0.2377358
Honestly speaking, the best way to approach this question is to apply a logistic regression on the the dependent variable salary due to its categorical nature. However, in order to experiment a linear model, we take a bold move to transform the salary back to continuous by randomly sampling values in between the pay levels. Three typical languages used by data science practitioners are included as well: Python, R and SQL.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Start: AIC=3.1
## log(salary) ~ age_group + gender + education + job + experience +
## python + r + sql + sql:r + sql:python + r:python + python:r:sql +
## job:sql + job:r + job:python + job:experience + job:education
##
## Df Sum of Sq RSS AIC
## - job:sql 11 7.366 1498.5 -9.810
## - job:python 11 9.613 1500.7 -7.045
## - education:job 49 75.055 1566.2 -4.296
## - job:r 12 15.427 1506.5 -1.912
## - job:experience 65 105.598 1596.7 -0.661
## - python:r:sql 1 0.129 1491.2 1.259
## <none> 1491.1 3.099
## - gender 4 33.081 1524.2 35.584
## - age_group 10 49.159 1540.3 42.943
##
## Step: AIC=-9.81
## log(salary) ~ age_group + gender + education + job + experience +
## python + r + sql + r:sql + python:sql + python:r + job:r +
## job:python + job:experience + education:job + python:r:sql
##
## Df Sum of Sq RSS AIC
## - job:python 11 10.415 1508.9 -19.0305
## - education:job 49 74.735 1573.2 -18.0130
## - job:experience 66 104.636 1603.1 -17.2761
## - job:r 12 13.994 1512.5 -16.6593
## - python:r:sql 1 0.116 1498.6 -11.6670
## <none> 1498.5 -9.8096
## - gender 4 33.324 1531.8 22.7705
## - age_group 10 49.541 1548.0 30.2010
##
## Step: AIC=-19.03
## log(salary) ~ age_group + gender + education + job + experience +
## python + r + sql + r:sql + python:sql + python:r + job:r +
## job:experience + education:job + python:r:sql
##
## Df Sum of Sq RSS AIC
## - education:job 49 71.194 1580.1 -31.970
## - job:r 12 14.428 1523.3 -25.473
## - python:r:sql 1 0.448 1509.3 -20.483
## <none> 1508.9 -19.031
## - job:experience 67 115.338 1624.2 -17.131
## - gender 4 34.627 1543.5 14.831
## - age_group 10 49.253 1558.2 20.231
##
## Step: AIC=-31.97
## log(salary) ~ age_group + gender + education + job + experience +
## python + r + sql + r:sql + python:sql + python:r + job:r +
## job:experience + python:r:sql
##
## Df Sum of Sq RSS AIC
## - job:experience 69 114.216 1694.3 -41.205
## - job:r 12 17.367 1597.5 -35.802
## - python:r:sql 1 0.158 1580.2 -33.786
## <none> 1580.1 -31.970
## - education 6 12.916 1593.0 -28.949
## - gender 4 35.101 1615.2 0.567
## - age_group 10 45.898 1626.0 0.859
##
## Step: AIC=-41.21
## log(salary) ~ age_group + gender + education + job + experience +
## python + r + sql + r:sql + python:sql + python:r + job:r +
## python:r:sql
##
## Df Sum of Sq RSS AIC
## - job:r 12 18.759 1713.1 -44.890
## - python:r:sql 1 0.075 1694.4 -43.124
## <none> 1694.3 -41.205
## - education 6 14.747 1709.0 -37.216
## - gender 4 40.921 1735.2 -5.174
## - age_group 10 62.087 1756.4 5.194
## - experience 6 76.442 1770.8 28.212
##
## Step: AIC=-44.89
## log(salary) ~ age_group + gender + education + job + experience +
## python + r + sql + r:sql + python:sql + python:r + python:r:sql
##
## Df Sum of Sq RSS AIC
## - python:r:sql 1 0.136 1713.2 -46.744
## <none> 1713.1 -44.890
## - education 6 13.016 1726.1 -42.925
## - gender 4 37.809 1750.9 -12.612
## - age_group 10 63.458 1776.5 2.220
## - job 12 79.981 1793.0 15.301
## - experience 6 73.691 1786.8 20.817
##
## Step: AIC=-46.74
## log(salary) ~ age_group + gender + education + job + experience +
## python + r + sql + r:sql + python:sql + python:r
##
## Df Sum of Sq RSS AIC
## - r:sql 1 0.023 1713.2 -48.719
## - python:r 1 1.249 1714.5 -47.399
## <none> 1713.2 -46.744
## - education 6 12.952 1726.2 -44.848
## - python:sql 1 4.106 1717.3 -44.328
## - gender 4 37.739 1750.9 -14.542
## - age_group 10 63.863 1777.1 0.782
## - job 12 79.876 1793.1 13.332
## - experience 6 75.066 1788.3 20.377
##
## Step: AIC=-48.72
## log(salary) ~ age_group + gender + education + job + experience +
## python + r + sql + python:sql + python:r
##
## Df Sum of Sq RSS AIC
## - python:r 1 1.329 1714.5 -49.289
## <none> 1713.2 -48.719
## - education 6 12.932 1726.2 -46.845
## - python:sql 1 4.113 1717.3 -46.295
## - gender 4 37.720 1750.9 -16.539
## - age_group 10 63.965 1777.2 -1.089
## - job 12 79.955 1793.2 11.436
## - experience 6 75.178 1788.4 18.515
##
## Step: AIC=-49.29
## log(salary) ~ age_group + gender + education + job + experience +
## python + r + sql + python:sql
##
## Df Sum of Sq RSS AIC
## - r 1 0.588 1715.1 -50.656
## <none> 1714.5 -49.289
## - education 6 13.274 1727.8 -47.060
## - python:sql 1 4.123 1718.7 -46.857
## - gender 4 38.204 1752.8 -16.630
## - age_group 10 64.559 1779.1 -1.094
## - job 12 79.665 1794.2 10.505
## - experience 6 78.185 1792.7 20.983
##
## Step: AIC=-50.66
## log(salary) ~ age_group + gender + education + job + experience +
## python + sql + python:sql
##
## Df Sum of Sq RSS AIC
## <none> 1715.1 -50.656
## - python:sql 1 4.322 1719.5 -48.013
## - education 6 14.190 1729.3 -47.455
## - gender 4 37.922 1753.1 -18.308
## - age_group 10 64.867 1780.0 -2.166
## - job 12 79.080 1794.2 8.508
## - experience 6 79.142 1794.3 20.572
##
## Call:
## lm(formula = log(salary) ~ age_group + gender + education + job +
## experience + python + sql + python:sql, data = lm_dat)
##
## Coefficients:
## (Intercept)
## 10.200984
## age_group22-24
## 0.240567
## age_group25-29
## 0.485461
## age_group30-34
## 0.774225
## age_group35-39
## 0.823671
## age_group40-44
## 0.766916
## age_group45-49
## 0.799968
## age_group50-54
## 0.981098
## age_group55-59
## 0.787097
## age_group60-69
## 0.562732
## age_group70+
## 0.358895
## genderNonbinary
## -0.559571
## genderPrefer not to say
## -0.478275
## genderPrefer to self-describe
## -0.548671
## genderWoman
## -0.312198
## educationDoctoral degree
## 0.209169
## educationI prefer not to answer
## -0.455451
## educationMaster’s degree
## 0.115979
## educationNo formal education past high school
## -0.109712
## educationProfessional doctorate
## 0.195027
## educationSome college/university study without earning a bachelor’s degree
## -0.081503
## jobData Analyst
## 0.005641
## jobData Engineer
## 0.134129
## jobData Scientist
## 0.332849
## jobDBA/Database Engineer
## -0.075810
## jobDeveloper Relations/Advocacy
## 1.308019
## jobMachine Learning Engineer
## 0.349711
## jobOther
## 0.068385
## jobProduct Manager
## 0.769284
## jobProgram/Project Manager
## 0.404084
## jobResearch Scientist
## -0.159111
## jobSoftware Engineer
## 0.179390
## jobStatistician
## -0.340587
## experience1-3 years
## 0.085388
## experience10-20 years
## 0.645163
## experience20+ years
## 0.691455
## experience3-5 years
## 0.285039
## experience5-10 years
## 0.494777
## experienceI have never written code
## -0.130967
## python
## -0.130113
## sql
## -0.200599
## python:sql
## 0.271468
Besides summary() table,
## Analysis of Variance Table
##
## Response: log(salary)
## Df Sum Sq Mean Sq F value Pr(>F)
## age_group 10 160.14 16.0136 16.8339 < 2.2e-16 ***
## gender 4 71.08 17.7694 18.6796 4.666e-15 ***
## education 6 34.20 5.6995 5.9915 3.246e-06 ***
## job 12 128.81 10.7340 11.2839 < 2.2e-16 ***
## experience 6 83.35 13.8922 14.6039 2.415e-16 ***
## python 1 0.01 0.0077 0.0081 0.92829
## sql 1 0.20 0.2040 0.2145 0.64334
## python:sql 1 4.32 4.3217 4.5430 0.03319 *
## Residuals 1803 1715.14 0.9513
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# invalidate cache when the package version changes
knitr::opts_chunk$set(cache = TRUE,
echo = FALSE,
eval = TRUE,
tidy = FALSE,
warning = FALSE,
cache.extra = packageVersion('tidyverse'))
options(htmltools.dir.version = FALSE)
if (!require("pacman")) install.packages("pacman")
pacman::p_load(tidyverse, ggthemes, latex2exp, glue,
hrbrthemes, plotly, stringr, DT, extrafont,
tidymodels)
set.seed(511)
ggplot2::theme_set(
theme_fivethirtyeight() +
theme(
text = element_text(family = "Roboto Condensed"),
title = element_text(size = 14),
plot.subtitle = element_text(size = 12),
plot.caption = element_text(size = 10),
axis.title = element_text(size = 14),
axis.text = element_text(size = 12),
panel.grid.minor.x = element_blank()
)
)
Since the same codings will be used in Problem 1-c, I will just use this part as a setup.
col_names <- names(read_csv(
"data/kaggle_survey_2021_responses.csv",
n_max=0))
dat <- read_csv(
"data/kaggle_survey_2021_responses.csv",
col_names = col_names, skip=2)
dat <- dat %>%
filter(Q3=="United States of America" )
job.dat <- dat %>%
filter(Q5 %in% c("Data Analyst",
"Data Engineer",
"Data Scientist",
"Machine Learning Engineer",
"Software Engineer",
"Statistician",
"Student")) %>%
mutate(Q25 = str_remove_all(Q25, "[$,]")) %>%
mutate(Q25 = str_replace(Q25, ">1000000", "1000000-2000000")) %>%
separate(Q25, into = c("salary_lb", "salary_ub"), sep = "-") %>%
mutate(salary_lb = as.numeric(salary_lb)) %>%
mutate(salary_ub = as.numeric(salary_ub))
# Q1
jtitle <- sort(table(dat$Q5), decreasing = T) %>%
as.data.frame() %>%
as.tibble()
jtitle <- rename(jtitle, `Job Title` = Var1)
ggplot(jtitle, aes(x="", y=Freq, fill=`Job Title`)) +
geom_bar(stat="identity", width=1, color="white") +
coord_polar("y", start=0) +
theme_void()
# Q2
# To change salary categories into FACTOR dtype with descending labels
poverty = c("$0-999", "1,000-1,999" , "2,000-2,999", "3,000-3,999", "4,000-4,999", "5,000-7,499", "7,500-9,999",
"10,000-14,999")
low = c("15,000-19,999", "20,000-24,999", "25,000-29,999", "30,000-39,999", "40,000-49,999")
medium = c("50,000-59,999", "60,000-69,999", "70,000-79,999", "80,000-89,999", "90,000-99,999")
high = c("100,000-124,999", "125,000-149,999", "150,000-199,999")
very_high = c("200,000-249,999", "250,000-299,999", "300,000-499,999")
highest = c("$500,000-999,999", ">$1,000,000")
dat$Q25[dat$Q25 %in% poverty] <- "poverty"
dat$Q25[dat$Q25 %in% low] <- "low"
dat$Q25[dat$Q25 %in% medium] <- "medium"
dat$Q25[dat$Q25 %in% high] <- "high"
dat$Q25[dat$Q25 %in% very_high] <- "very high"
dat$Q25[dat$Q25 %in% highest] <- "highest"
dat$Q25 = factor(dat$Q25, levels = c("poverty", "low", "medium", "high", "very high", "highest"), ordered = T)
data_side <- c("Data Scientist", "Data Analyst", "Business Analyst", "Data Engineer", "Statistician", "DBA/Database Engineer")
swe_side <- c("Software Engineer", "Machine Learning Engineer", "Program/Project Manager", "Product Manager")
academia <- c("Student", "Other", "Research Scientist")
dat[dat$Q5 %in% data_side & !is.na(dat$Q5) & !is.na(dat$Q25), ] %>%
ggplot( aes(x=Q5, y=Q25, color = Q5)) +
geom_count() +
ggtitle("Two-Way Salary Visualizations: Data-Oriented Jobs") +
xlab("") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
dat[dat$Q5 %in% swe_side & !is.na(dat$Q5) & !is.na(dat$Q25), ] %>%
ggplot( aes(x=Q5, y=Q25, color = Q5)) +
geom_count() +
ggtitle("Two-Way Salary Visualizations: Engineering-Oriented Jobs") +
xlab("") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
dat[dat$Q5 %in% academia & !is.na(dat$Q5) & !is.na(dat$Q25), ] %>%
ggplot( aes(x=Q5, y=Q25, color = Q5)) +
geom_count() +
ggtitle("Two-Way Salary Visualizations: Academic Jobs and Others") +
xlab("") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
expectedcounts <- function(A){
r <- rowSums(A)
c <- colSums(A)
N = sum(A)
expected <- outer(r,c)/N
return(expected)
}
q2_table.1 = table(dat$Q25[dat$Q5 %in% data_side], dat$Q5[dat$Q5 %in% data_side])
q2_table.df = q2_table.1 %>% as.data.frame() %>% as.tibble()
q2_expectedcounts <- expectedcounts(q2_table.1) %>% as.data.frame() %>% as.tibble()
ggplot(q2_table.df, aes(x = Var2, y = Freq, fill = Var1)) +
geom_col(position = "dodge") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
(chisq.test(q2_table.1, simulate.p.value = T, B = 10000))
q2_table.2 = table(dat$Q25[dat$Q5 %in% swe_side], dat$Q5[dat$Q5 %in% swe_side])
q2_table.df2 = q2_table.2 %>% as.data.frame() %>% as.tibble()
q2_expectedcounts2 <- expectedcounts(q2_table.2) %>% as.data.frame() %>% as.tibble()
ggplot(q2_table.df2, aes(x = Var2, y = Freq, fill = Var1)) +
geom_col(position = "dodge") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
(chisq.test(q2_table.2, simulate.p.value = T, B = 10000))
# Q3
edu <- sort(table(dat$Q4), decreasing = T) %>%
as.data.frame() %>%
as.tibble()
edu <- rename(edu, `Degree` = Var1)
ggplot(edu, aes(x="", y=Freq, fill=`Degree`)) +
geom_bar(stat="identity", width=1, color="white") +
coord_polar("y", start=0) +
theme_void()
Q3_temp <- sort(table(dat$Q4), decreasing = T)
(Q3_temp <- Q3_temp / sum(Q3_temp))
edu.1 <- sort(table(dat[dat$Q5 == "Data Scientist",]$Q4), decreasing = T) %>%
as.data.frame() %>%
as.tibble()
edu.1 <- rename(edu.1, `Degree` = Var1)
ggplot(edu.1, aes(x="", y=Freq, fill=`Degree`)) +
geom_bar(stat="identity", width=1, color="white") +
coord_polar("y", start=0) +
theme_void()
EDU <- table(dat$Q5, dat$Q4) %>%
as.data.frame() %>%
as.tibble()
EDU <- rename(EDU, `Job Title` = Var1)
EDU <- rename(EDU, `Degree` = Var2)
# ggplot(EDU, aes(x="", y=Freq, fill=`Degree`)) +
# geom_bar(stat="identity", width=1, color="white") +
# coord_polar("y", start=0) +
# theme_void() +
# facet_wrap(vars(`Job Title`), ncol = 5) +
# ggtitle("Pie Chart of Degrees by Each Job Title") +
# theme(legend.position = "bottom")
print_pie_chart <- function(job){
edu <- sort(table(dat[dat$Q5 == job,]$Q4), decreasing = T) %>%
as.data.frame() %>%
as.tibble()
edu <- rename(edu, `Degree` = Var1)
print(ggplot(edu, aes(x="", y=Freq, fill=`Degree`)) +
geom_bar(stat="identity", width=1, color="white") +
coord_polar("y", start=0) +
ggtitle(job) +
theme_void())
}
jobs <- unique(dat$Q5)
print_pie_chart("Software Engineer")
print_pie_chart("Data Engineer")
# Q5
table(dat$Q2, dat$Q25)
for (agegroup in unique(dat$Q6)){
if (agegroup != "I have never written code"){
print(agegroup)
print(table(dat$Q2[dat$Q6 == agegroup], dat$Q25[dat$Q6 == agegroup]))
}
}
# Q5
skill.set <- job.dat %>%
filter(Q5 != "Other") %>%
select(c(Q5, starts_with("Q7_"), starts_with("Q9_"),
starts_with("Q12_"), starts_with("Q14_"),
starts_with("Q16_"), starts_with("Q17_"),
starts_with("Q18_"), starts_with("Q19_"),
salary_lb)) %>%
mutate(Total = "`Total`") %>%
gather("fake_key", "skillset",
-c(Q5, salary_lb), na.rm = T) %>%
filter(!skillset %in% c("None", "Other")) %>%
rename(title = Q5) %>%
group_by(title, skillset) %>%
summarise(n = n(),
salary_mean = round(mean(salary_lb, na.rm = T)),
salary_sd = round(sd(salary_lb, na.rm = T)),
.groups = "drop") %>%
group_by(title) %>%
mutate(prop = round(n * 100 / max(n), 1)) %>%
filter(prop >= 0.1) %>%
select(-n) %>%
arrange(title, desc(prop))
datatable(skill.set, filter = 'top', width = 800)
# Q6
industry.dat <- job.dat %>%
filter(Q5 != "Student") %>%
select(Q5, Q20, salary_lb, salary_ub) %>%
mutate(Q20 = case_when(
Q20 == "Academics/Education" ~ "Academics",
Q20 == "Accounting/Finance" ~ "Finance",
Q20 == "Computers/Technology" ~ "Computers",
Q20 == "Medical/Pharmaceutical" ~ "Medical",
Q20 == "Online Service/Internet-based Services" ~ "Internet",
TRUE ~ Q20
)) %>%
filter(Q20 %in% c("Academics", "Finance", "Computers",
"Medical", "Internet"))
p <- industry.dat %>%
count(Q5, Q20) %>%
mutate(Q20 = fct_reorder(Q20, n, .fun="sum")) %>%
rename(title=Q5, Industry=Q20, count=n) %>%
ggplot(aes(x=title, y=count)) +
geom_bar(stat = "identity") +
coord_flip() +
facet_wrap(~ Industry) +
labs(
title = "Users' work industry",
caption = glue("Author: celeritasML
Source: Kaggle")) +
theme(axis.ticks.x = element_blank(),
axis.title = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank())
ggplotly(p)
ind.dis <- industry.dat %>%
filter(Q5 != "Statistician") %>%
rename(Title = Q5, Industry = Q20)
ind.dis %>%
group_by(Industry) %>%
count(Title) %>%
mutate(Proportion = round(prop.table(n), 3)) %>%
select(-n) %>%
pivot_wider(names_from = Industry, values_from = Proportion)
aca.dis <- ind.dis %>% filter(Industry == "Academics")
com.dis <- ind.dis %>% filter(Industry == "Computers")
fin.dis <- ind.dis %>% filter(Industry == "Finance")
int.dis <- ind.dis %>% filter(Industry == "Internet")
med.dis <- ind.dis %>% filter(Industry == "Medical")
ds.dis <- ind.dis %>% filter(Title == "Data Scientist")
da.dis <- ind.dis %>% filter(Title == "Data Analyst")
de.dis <- ind.dis %>% filter(Title == "Data Engineer")
my.prop.test <- function(target.dis, other.dis, title) {
target <- sum(target.dis$Title == title)
other <- sum(other.dis$Title == title)
t.res <- prop.test(c(target, other),
c(nrow(target.dis), nrow(other.dis)),
alternative = "greater",
correct = TRUE)
return(paste0("X-squared = ", round(t.res$statistic,1),
", p-value = ", formatC(t.res$p.value,
format = "e", digits = 2),
", 95% CI = (", round(t.res$conf.int[1], 2),
", ", round(t.res$conf.int[2], 2),"), ",
if_else(t.res$p.value < 0.05, "H0 rejected",
"can't reject H0")))
}
my.prop.test2 <- function(ind.dis, title1, title2) {
target <- sum(ind.dis$Title == title1)
other <- sum(ind.dis$Title == title2)
t.res <- prop.test(c(target, other),
c(nrow(ind.dis), nrow(ind.dis)),
alternative = "greater",
correct = TRUE)
return(paste0("X-squared = ", round(t.res$statistic,1),
", p-value = ", formatC(t.res$p.value,
format = "e", digits = 2),
", 95% CI = (", round(t.res$conf.int[1], 2),
", ", round(t.res$conf.int[2], 2),"), ",
if_else(t.res$p.value < 0.05, "H0 rejected",
"can't reject H0")))
}
p <- industry.dat %>%
filter(salary_lb != 1000000) %>%
mutate(Q20 = fct_reorder(Q20, salary_lb, .fun='length')) %>%
ggplot(aes(x=Q20, y=salary_lb)) +
geom_boxplot() +
coord_flip() +
facet_wrap(~ Q5) +
labs(
title = "Users' salary vs industry",
caption = glue("Author: celeritasML
Source: Kaggle")) +
theme(axis.ticks.x = element_blank(),
axis.text.x = element_blank(),
axis.title = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank())
ggplotly(p, tooltip="text")
# Q7
programming <- job.dat %>%
select(c(Q5, starts_with("Q7_"))) %>%
gather("fake_key", "language", -Q5, na.rm = T) %>%
rename(title = Q5) %>%
select(-fake_key) %>%
filter(!language %in% c("None", "Other")) %>%
count(title, language, .drop = FALSE) %>%
complete(title, language) %>%
replace_na(list(n = 0)) %>%
group_by(title) %>%
mutate(prop = prop.table(n))
p <- programming %>%
mutate(text = paste0("Language: ", language, "\n",
"Job title: ", title, "\n",
"Count: ", n, "\n",
"Proportion: ", round(prop, 3))) %>%
ggplot(aes(language, title, fill=prop, text=text)) +
geom_tile() +
scale_fill_gradient(low="white", high="royalblue") +
labs(
title = "Users' favorite programming language",
caption = glue("Author: celeritasML
Source: Kaggle")) +
theme(axis.ticks.x = element_blank(),
axis.text.x = element_text(angle=90, hjust=1),
axis.title = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank())
ggplotly(p, tooltip="text")
ide <- job.dat %>%
select(c(Q5, starts_with("Q9_"))) %>%
gather("fake_key", "IDE", -Q5, na.rm = T) %>%
rename(title = Q5) %>%
select(-fake_key) %>%
mutate(IDE = case_when(
IDE == "Visual Studio Code (VSCode)" ~ "VSCode",
IDE == "Jupyter (JupyterLab, Jupyter Notebooks, etc)" ~ "Jupyter Notebook",
TRUE ~ IDE
)) %>%
filter(!IDE %in% c("None", "Other")) %>%
count(title, IDE, .drop = FALSE) %>%
complete(title, IDE) %>%
replace_na(list(n = 0)) %>%
group_by(title) %>%
mutate(prop = prop.table(n))
p <- ide %>%
mutate(text = paste0("IDE: ", IDE, "\n",
"Job title: ", title, "\n",
"Count: ", n, "\n",
"Proportion: ", round(prop, 3))) %>%
ggplot(aes(IDE, title, fill=prop, text=text)) +
geom_tile() +
scale_fill_gradient(low="white", high="royalblue") +
labs(
title = "Users' favorite IDE",
caption = glue("Author: celeritasML
Source: Kaggle")) +
theme(axis.ticks.x = element_blank(),
axis.text.x = element_text(angle=90, hjust=1),
axis.title = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank())
ggplotly(p, tooltip="text")
pj.dat <- job.dat %>%
select(c(Q5, Q7_Part_1, Q9_Part_1, Q9_Part_11)) %>%
rename(Title = Q5) %>%
mutate(Python = !is.na(Q7_Part_1),
Jupyter = !is.na(Q9_Part_1) | !is.na(Q9_Part_11)) %>%
select(Title, Python, Jupyter) %>%
pivot_longer(cols = c(Python, Jupyter)) %>%
filter(value == TRUE)
py.ju.test <- function(excludes) {
pj.t.dat <- pj.dat %>% filter(!Title %in% excludes)
print(table(pj.t.dat$name, pj.t.dat$Title))
pj.t.res <- chisq.test(pj.t.dat$name, pj.t.dat$Title,
simulate.p.value = T)
return(paste0("X-squared = ", round(pj.t.res$statistic, 1),
", p-value = ", round(pj.t.res$p.value, 2),
if_else(pj.t.res$p.value < 0.05, ", H0 rejected",
", can't reject H0")))
}
table(pj.dat$Title, pj.dat$name)
# Q8
learning_platform <- job.dat %>%
select(c(Q5, starts_with("Q40_"))) %>%
gather("fake_key", "learning", -Q5, na.rm = T) %>%
rename(title = Q5) %>%
select(-fake_key) %>%
mutate(learning = case_when(
learning == "Cloud-certification programs (direct from AWS, Azure, GCP, or similar)" ~ "Cloud-certif Programs",
learning == "University Courses (resulting in a university degree)" ~ "University",
TRUE ~ learning
)) %>%
filter(!learning %in% c("None", "Other")) %>%
count(title, learning, .drop = FALSE) %>%
complete(title, learning) %>%
replace_na(list(n = 0)) %>%
group_by(title) %>%
mutate(prop = prop.table(n))
p <- learning_platform %>%
mutate(text = paste0("Platform: ", learning, "\n",
"Job title: ", title, "\n",
"Count: ", n, "\n",
"Proportion: ", round(prop, 3))) %>%
ggplot(aes(learning, title, fill=prop, text=text)) +
geom_tile() +
scale_fill_gradient(low="white", high="royalblue") +
labs(
title = "Users' favorite learning platforms",
caption = glue("Author: celeritasML
Source: Kaggle")) +
theme(axis.ticks.x = element_blank(),
axis.text.x = element_text(angle=90, hjust=1),
axis.title = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank())
ggplotly(p, tooltip="text")
share_deploy <- job.dat %>%
select(c(Q5, starts_with("Q39_"))) %>%
gather("fake_key", "share", -Q5, na.rm = T) %>%
rename(title = Q5) %>%
select(-fake_key) %>%
mutate(share = case_when(
share == "I do not share my work publicly" ~ "\'PRIVATE\'",
TRUE ~ share
)) %>%
filter(!share %in% c("Other")) %>%
count(title, share, .drop = FALSE) %>%
complete(title, share) %>%
replace_na(list(n = 0)) %>%
group_by(title) %>%
mutate(prop = prop.table(n))
p <- share_deploy %>%
mutate(text = paste0("Platform: ", share, "\n",
"Job title: ", title, "\n",
"Count: ", n, "\n",
"Proportion: ", round(prop, 3))) %>%
ggplot(aes(share, title, fill=prop, text=text)) +
geom_tile() +
scale_fill_gradient(low="white", high="royalblue") +
labs(
title = "Users' favorite share platforms",
x = "",
y = "",
caption = glue("Author: celeritasML
Source: Kaggle")) +
theme(axis.ticks.x = element_blank(),
axis.text.x = element_text(angle=90, hjust=1),
axis.title = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank())
ggplotly(p, tooltip="text")
media_source <- job.dat %>%
select(c(Q5, starts_with("Q42_"))) %>%
gather("fake_key", "media", -Q5, na.rm = T) %>%
rename(title = Q5) %>%
select(-fake_key) %>%
filter(!media %in% c("None", "Other")) %>%
count(title, media, .drop = FALSE) %>%
complete(title, media) %>%
replace_na(list(n = 0)) %>%
group_by(title) %>%
mutate(prop = prop.table(n)) %>%
separate(media, into = c("media", "media_suffix"), sep = " \\(")
p <- media_source %>%
mutate(text = paste0("Platform: ", media, "\n",
"Job title: ", title, "\n",
"Count: ", n, "\n",
"Proportion: ", round(prop, 3))) %>%
ggplot(aes(media, title, fill=prop, text=text)) +
geom_tile() +
scale_fill_gradient(low="white", high="royalblue") +
labs(
title = "Users' favorite media source",
caption = glue("Author: celeritasML
Source: Kaggle")) +
theme(axis.ticks.x = element_blank(),
axis.text.x = element_text(angle=90, hjust=1),
axis.title = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank())
ggplotly(p, tooltip="text")
# Q9
viz_lib <- dat %>%
select(Q5, starts_with("Q14")) %>%
select(-c(Q14_Part_11, Q14_OTHER))
viz_lib <- viz_lib %>%
pivot_longer(cols=starts_with("Q14")) %>%
select(-name) %>%
drop_na() %>%
filter(!(Q5 %in% c("Other", "DBA/Database Engineer",
"Developer Relations/Advocacy",
"Currently not employed",
"Statistician", "Product Manager")))
ggplot(viz_lib) +
geom_bar(aes(y = value, fill = value)) +
facet_wrap(~ Q5) +
scale_fill_brewer(palette = "Spectral") +
labs(
title = "Data science practitioners' favorite viz libraries",
x = "",
y = "",
caption = glue("Author: celeritasML
Source: Kaggle")
) +
theme(
axis.ticks.x = element_line(),
axis.ticks.y = element_blank(),
axis.text.x = element_text(size = 6),
axis.text.y = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank())
ml_lib <- dat %>%
select(Q5, starts_with("Q16")) %>%
select(-c(Q16_Part_17, Q16_OTHER))
ml_lib <- ml_lib %>%
pivot_longer(cols=starts_with("Q16")) %>%
select(-name) %>%
drop_na() %>%
filter(!(Q5 %in% c("Other", "DBA/Database Engineer",
"Developer Relations/Advocacy",
"Currently not employed",
"Statistician", "Product Manager")))
top_10_ml <- ml_lib %>%
group_by(value) %>%
summarize(count = n()) %>%
arrange(desc(count)) %>%
top_n(10)
ml_lib <- ml_lib %>%
filter(value %in% top_10_ml$value)
ggplot(ml_lib) +
geom_bar(aes(y = value, fill = value)) +
facet_wrap(~ Q5) +
scale_fill_brewer(palette = "Spectral") +
labs(
title = "Data science practitioners' favorite ML libraries",
x = "",
y = "",
caption = glue("Author: celeritasML
Source: Kaggle")
) +
theme(
axis.ticks.x = element_line(),
axis.ticks.y = element_blank(),
axis.text.x = element_text(size = 6),
axis.text.y = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank())
viz_lib %>%
group_by(Q5) %>%
table() %>%
addmargins() %>%
print()
ml_lib %>%
group_by(Q5) %>%
table() %>%
addmargins() %>%
print()
# Q10
aws_user <- tibble(
ec2 = dat$Q29_A_Part_1,
s3 = dat$Q30_A_Part_3,
efs = dat$Q30_A_Part_4,
sagemaker = dat$Q31_A_Part_1,
redshift = dat$Q32_A_Part_11,
aurora = dat$Q32_A_Part_12,
rds = dat$Q32_A_Part_13,
dynamodb = dat$Q32_A_Part_14
) %>%
mutate(ec2 = if_else(is.na(ec2), 0, 1),
s3 = if_else(is.na(s3), 0, 1),
efs = if_else(is.na(efs), 0, 1),
sagemaker = if_else(is.na(sagemaker), 0, 1),
redshift = if_else(is.na(redshift), 0, 1),
aurora = if_else(is.na(aurora), 0, 1),
rds = if_else(is.na(rds), 0, 1),
dynamodb = if_else(is.na(dynamodb), 0, 1))
chisq.test(aws_user$ec2, aws_user$s3)
chisq.test(aws_user$ec2, aws_user$efs)
chisq.test(aws_user$ec2, aws_user$sagemaker)
# Q11
cloud_comp <- tibble(
aws_usage = dat$Q27_A_Part_1,
azure_usage = dat$Q27_A_Part_2,
gcp_usage = dat$Q27_A_Part_3
) %>%
mutate(aws_usage = if_else(is.na(aws_usage), FALSE, TRUE),
azure_usage = if_else(is.na(azure_usage), FALSE, TRUE),
gcp_usage = if_else(is.na(gcp_usage), FALSE, TRUE))
prop.test(c(sum(cloud_comp$aws_usage), sum(cloud_comp$azure_usage)),
c(nrow(cloud_comp), nrow(cloud_comp)),
alternative = "two.sided",
correct = TRUE)
prop.test(c(sum(cloud_comp$azure_usage), sum(cloud_comp$gcp_usage)),
c(nrow(cloud_comp), nrow(cloud_comp)),
alternative = "two.sided",
correct = TRUE)
prop.test(c(sum(cloud_comp$gcp_usage), sum(cloud_comp$aws_usage)),
c(nrow(cloud_comp), nrow(cloud_comp)),
alternative = "two.sided",
correct = TRUE)
# Q12
lm_dat <- dat %>%
select(Q1, Q2, Q4, Q5, Q6,
Q7_Part_1, Q7_Part_2, Q7_Part_3,
Q25) %>%
drop_na(Q25) %>%
rename(age_group = Q1,
gender = Q2,
education = Q4,
job = Q5,
experience = Q6,
salary = Q25,
python = Q7_Part_1,
r = Q7_Part_2,
sql = Q7_Part_3)
set.seed(511)
# - poverty: below the federal minimum wage
# - low: 40,000 to 79,999
# - medium: 80,000 to 124,999
# - high: 125,000 to 199,999
# - very high: 200,000 to 499,999
# - highest: >= 500,000
lm_dat <- lm_dat %>%
rowwise() %>%
mutate(salary = case_when(
salary == "poverty" ~ sample(1:39999, 1),
salary == "low" ~ sample(40000:79999, 1),
salary == "medium" ~ sample(80000:124999, 1),
salary == "high" ~ sample(125000:199999, 1),
salary == "very high" ~ sample(200000:499999, 1),
salary == "highest" ~ sample(500000:2000000, 1)
)) %>%
ungroup()
ggplot(lm_dat, aes(x=salary)) +
geom_histogram() +
labs(
title = "Histogram of salaries",
caption = glue("Author: celeritasML
Source: Kaggle"))
ggplot(lm_dat, aes(x=log(salary))) +
geom_histogram() +
labs(
title = "Histogram of log(salary)",
caption = glue("Author: celeritasML
Source: Kaggle"))
lm_dat <- lm_dat %>%
mutate(python = if_else(is.na(python), 0, 1),
r = if_else(is.na(r), 0, 1),
sql = if_else(is.na(sql), 0, 1))
model1 <- lm(log(salary) ~ . + sql:r + sql:python + r:python + python:r:sql +
job:sql + job:r + job:python + job:experience +
job:education, data=lm_dat)
MASS::stepAIC(model1)
model2 <- lm(log(salary) ~ age_group + gender + education + job + experience +
python + sql + python:sql, data = lm_dat)
anova(model2)See footnote 11.